Data Analysis with Python (Pandas)


In the Data Science community, Python is well known for its excellent data manipulation capabilities (e.g. matrix handling). In this chapter, we will review a number of vital external libraries in this respect (numpy, pandas), which allow scholars to load, manipulate and analyze tabular data. As example data, we will work with data from the Meertens Tune Collection. We will work with the MTC-FS dataset which consists of 4120 digitally encoded vocal folk songs both from Onder de groene linde (2503) and from various related written sources (1617).

In this chapter, we will show you some of the basic plotting functionality available in Python. The plots can be embedded in the notebooks by executing the following cell:


In [ ]:
%matplotlib inline

The default colors of matplotlib are not the most pretty ones. We can change that using:


In [ ]:
import matplotlib
matplotlib.style.use('ggplot')

This chapter will introduce you to the basics of the Python Data Analysis Library Pandas. The library is very well documented and contains about all the fuctionality you need to work with tabular data and time-indexed data. We begin with importing the library into the Python workspace:


In [ ]:
import pandas

Reading Data from a csv / excel file

Pandas provides a simple method to read CSV files. To read the metadata file of the Meertens Tune Collection, execute the following cell:


In [ ]:
df = pandas.read_csv("data/MTC-FS.csv")

pandas.read_csv returns a DataFrame object, which is one of the most important data structures in Pandas. Let's have a look at this data structure:


In [ ]:
df

The IPython notebooks work seamlessly with Pandas and display a nicely formatted HTML table. We can access the first, say, 5 rows of this table using a slicing operation:


In [ ]:
df[:5]

An equivalent, yet slightly better understandable method is to use the .head method:


In [ ]:
df.head()

By default .head returns the first 5 rows of a DataFrame. The method takes one optional argument n with which you can specify how many rows to show:


In [ ]:
df.head(n=10)

The pandas.read_csv method takes many optional arguments that allow you to parse all kinds of CSV files (e.g. with different encodings, missing values, different separators etc.). One of the columns in our dataframe contains the date of the recordings. By default, Pandas will parse those columns as string objects. We can specify which columns should be parsed as dates using the following statement:


In [ ]:
df = pandas.read_csv("data/MTC-FS.csv", parse_dates=['date_of_recording'])

Remember, to quickly access the documentation of a method or function, you can use a question mark, as in:


In [ ]:
pandas.read_csv?

Since we will be using many functions such as .read_csv it would be convenient not having to type pandas all the time. Python allows you to import libraries using an alias, as in:


In [ ]:
import pandas as pd

Now we can use pd.read_csv instead of pandas.read_csv:


In [ ]:
df = pd.read_csv("data/MTC-FS.csv", parse_dates=['date_of_recording'])

Accessing Rows & Columns

Pandas DataFrames work much like regular Python dictionaries. You can retrieve the contents of a column as follows:


In [ ]:
df['tunefamily']

If the column name does not contain any spaces, Pandas will turn the column name into an attribute of the DataFrame which can be accessed as follows:


In [ ]:
df.tunefamily

To access a particular row of a DataFrame, Pandas specifies to use the method .ix, as in:


In [ ]:
df.ix[0]

Q1

Write some code that shows the contents of another column. Tip: you can use df.columns to see which column names are available.


In [ ]:
# insert your code here

We have seen the method .head. What would be the equivalent method to show the n last rows of a DataFrame? Hint: think of a dog. Try printing the n=20 last rows.


In [ ]:
# insert your code here

Selecting Multiple Columns

Pandas allows you to conveniently select multiple columns in a DataFrame. The syntax is the same as with a single column, but here we provide a list of the columns we want to retrieve:


In [ ]:
df[['tunefamily', 'title']]

These objects are reduced DataFrames that behave exactly the same as the orginal DataFrame. We can slice them as follows:


In [ ]:
df[['tunefamily', 'title']][:5]

Or we first slice the DataFrame and then select the columns we're interested in:


In [ ]:
df[:5][['tunefamily', 'title']]

Counting discrete variables

Most columns in the MTC dataset are categorical. Pandas provides the convenient method .value_counts which returns how often a particular value of a column occurs. Let's see that in action:


In [ ]:
df.tunefamily.head()

As you can see, the column tunefamily contains many duplicates. We would like to obtain a frequency distribution of the tune families in the collection. This can be easily done using:


In [ ]:
df.tunefamily.value_counts()

To print only the n most frequent tune families, we can use:


In [ ]:
df.tunefamily.value_counts()[:10]

It gets even better. We can plot these frequency distributions by simply adding .plot!


In [ ]:
df.tunefamily.value_counts()[:10].plot(kind='bar')

Q2

Make a bar plot showing the 10 most frequent places where recordings have been made.


In [ ]:
# insert your code here

Make a density plot of the place name frequency distribution. Tip: search for 'density' at: http://pandas.pydata.org/pandas-docs/stable/visualization.html


In [ ]:
# insert you code here

The method .value_counts returns an object which is called a Series in pandas. DataFrames consist of columns which are Series objects. Series have many methods available, such as the .plot method. Other methods are: .sum, .mean, .median etc:


In [ ]:
tune_counts = df.tunefamily.value_counts()
print(tune_counts.sum())
print(tune_counts.mean())
print(tune_counts.median())

To get a quick summary of a Series object, use the method .describe:


In [ ]:
tune_counts.describe()

Q3

In the next cell, place your cursor behind the dot and hit TAB to retrieve an overview of the many methods available.


In [ ]:
tune_counts.

Write a line of code which computes the cumulative sum of tune_counts and plots it as a line plot. A cumulative sum is a series of partial sums over a given sequence, rather than just the final number in that series (which would be the regular sum).


In [ ]:
# insert your code here

Data selection under conditions

Often we are interested in particular rows of a dataframe that satisfy particular conditions. We might be interested in all recordings that were made in Groningen for example, or all recordings before 1980. In the previous chapters you have written a lot of code that involved looping over an iterable and depending on whether some condition held, append each item in that iterable to a new list. In Pandas people generally avoid looping and prefer to use so-called vectorized operations that take care of the looping internally.

Say we are interested in all records of which the tune family is equal to Die schuintamboers 1. Normally we would write a loop, such as:

results = []
for row in df:
    if row.tunefamily == 'Drie schuintamboers 1':
        results.append(row)
print(results)

(Note that the above code will not actually work, because of the way DataFrames are structured and looped over, but it shows a standard looping approach). In Pandas you can achieve the same thing (and better!) by using the following single line:


In [ ]:
df[df.tunefamily == 'Drie schuintamboers 1']

Let's break that statement down. The inner statement df.tunefamily == 'Drie schuintamboers 1' tests for each item in the column tunefamily whether it is equal to 'Drie schuintamboers 1'. When we execute that seperately, Python returns a Series object with boolean values that are either True (if it was a match) or False (if the item didn't match):


In [ ]:
df.tunefamily == 'Drie schuintamboers 1'

When we index our df with this Series object, we get just the rows where the condition holds.


Q4

Write a line of code that returns a new dataframe in which all recordings are from Groningen.


In [ ]:
# insert your code here

What is the tunefamily that is most often attested in Amsterdam? (Tip: remember the value_counts method you can call on (filtered) DataFrames. Also, Series objects have a method called argmax that returns the argument corresponding to a maximum value.)


In [ ]:
# insert your code here

We can easily select data on the basis of multiple conditions. Here is an example:


In [ ]:
in_Assen = df.place_of_recording == 'Assen'
is_Suze = df.tunefamily == 'Suze mien lam'
df[in_Assen & is_Suze]

Just to show you that the returned object is another DataFrame, we select two columns to construct a reduced DataFrame:


In [ ]:
df[in_Assen & is_Suze][['filename', 'place_of_recording']]

Q5

When we look at all values in df.place_of_recording we notice that many values are missing (NaN values).


In [ ]:
df.place_of_recording

We can use the method .isnull (e.g. on a column) to select all items that are NaN. Similarly, the method .notnull is to select all items that are not NaN. Write a line of code that returns a new DataFrame in which for all records the place of recording is known.


In [ ]:
# insert your code here

TimeSeries

Pandas is a library with tremendous functionality for tabular data. Another key feature of the library is the way it can work with time series. The MTC contains a column date_of_recording which provides for most but not all items the date at which the recording was made. In the following analyses, we would like to work with data for which the date of recording is known. We create the corresponding DataFrame as follows:


In [ ]:
df = df[df.date_of_recording.notnull()]

To conveniently work with time series, we set the index of the dataframe (i.e. the labels of the rows) to hold the contents of the column date_of_recording. Note that after this operation, the column date_of_recording is no longer part of the dataframe but only as an index.


In [ ]:
df = df.set_index("date_of_recording")

In [ ]:
df.head()

The Index of our new df is now of the type DatetimeIndex. (If not, the CSV was not read in with the argument parse_dates=['date_of_recording'])


In [ ]:
type(df.index)

This data structure provides a vast amount of methods and functions to manipulate and index the data. For example, we can retrieve the year of each recording as follows:


In [ ]:
df.index.year

Or the day:


In [ ]:
df.index.day

Or the week:


In [ ]:
df.index.week

Q6

Write a single line of code which returns a new DataFrame consisting of all recordings made at January first (independent of the year). Tip: type df.index. followed by TAB to scan through an overview of all the available methods.


In [ ]:
# insert your code here

Add a column to a Dataframe

Pandas makes it really easy to add new columns to a DataFrame. In the following cell we add a column containing the week number at which the recording was made.


In [ ]:
df['week'] = df.index.week

In [ ]:
df.week[:10]

Q7

Add a column year to the DataFrame which contains for each record the year of recording.


In [ ]:
# insert your code here

Grouping data

Often we would like to group our data on the basis of some criterium and perform operations on those groups. We could, for example, group all recordings on the basis of where they were recorded. Pandas provides the convenient (though sometimes challenging) method .groupby. It works as follows:


In [ ]:
grouped = df.groupby('place_of_recording')

In [ ]:
grouped

A DataFrameGroupBy object consists of several smaller DataFrames per group. We grouped the original dataframe on the key place_of_recording and therefore the grouped object contains a DataFrame for each unique place of recording.

We can use the method .size to assess the size (i.e. number of records) of each group:


In [ ]:
grouped.size()

When we add the method .sort_values and specify that we want the sorting to be done in descending order we obtain the following Series:


In [ ]:
grouped.size().sort_values(ascending=False)

Q8

Just to recap, compute the max, min, mean, and median number of items per group.


In [ ]:
# insert your code here

Grouped objects are extremely helpful when we want to compute a frequency distribution for each unique item of a particular column. Say we would like to investigate the frequency distribution of tune families per place of recording. Given that we have created a grouped object with place_of_recording as key, all we need to do is the following:


In [ ]:
grouped.tunefamily.value_counts()

Let's make it slightly more complex. In the previous example we created a grouped object using a single key. We can also specify multiple keys, as in:


In [ ]:
grouped = df.groupby(['place_of_recording', 'singer_id_s'])

This produces the following data structure:


In [ ]:
grouped.tunefamily.value_counts()

Q9

Make a plot that visualizes for each year how many recordings were made. Tip: first create a grouped object.


In [ ]:
# insert your code here

String operations

One of my favourite functionalities of Pandas is the way it allows you to do vectorized string operations. These operations can be useful if you want to select particular string cells in your dataframe or when you want to manipulate / correct the data. Let's have a look at some of the string data in the MTC:


In [ ]:
firstlines = df.firstline

In [ ]:
firstlines[:10]

Now, firstlines is a Series corresponding to the firstline column, which contains strings. We can access the string methods of a column with .str. In the following cell, place your cursor after the last dot hit TAB:


In [ ]:
firstlines.str.

As you can see, almost all regular string operations are available. The neat thing is that if you call a method like .upper, the method will be automatically applied to all items in the series (this is referred to as vectorized operations). Let's see that in action:


In [ ]:
firstlines.str.upper().head()

The method .contains is particularly useful if you want to extract all strings that contain a particular substring. We could, for example, search for the string boom to extract all records that mention a tree in the first lines:


In [ ]:
firstlines.str.contains('boom').head()

The method .contains returns a boolean series which we can use to index the dataframe:


In [ ]:
df[firstlines.str.contains('boom')]

To give another example, let's search for all records of which the first line mentions the word zomer (summer):


In [ ]:
is_summer = firstlines.str.contains('zomer')

In [ ]:
is_summer.head()

As you can see, this returns a boolean Series object. Note that the original DatetimeIndex is still in place. We can use that to conveniently plot all occurrences over the years:


In [ ]:
is_summer.plot()

When we index firstlines with is_summer we retrieve a new Series object consisting of all records that mention the word zomer in their first line:


In [ ]:
summer_lines = firstlines[is_summer]
summer_lines

Q10

Write some code that computes in which quarter of the year most songs about summer are recorded. Tip: use the .index attribute and a Counter object from the collections library.


In [ ]:
# insert your code here

Exercises

You now should have a basic understanding of some important concepts in the Pandas library. The library contains many more interesting functions and methods and it takes quite some time to master them all. The best way to become more confident in using the library is by working with your own data and try to do some basic analyses.

The following exercises are to give some more practice. Some problems will be tough, and you cannot always rely on the information provided in this chapter. This is something you'll have to get used to. Don't worry, visit the documentation website (see http://pandas.pydata.org/pandas-docs/version/0.17.0/), Google for particular problems or ask questions on the Slack channel. Asking for help is all part of learning a new library.

In the following exercises we begin with working on the Dative alternation dataset of Bresnan et al. (2007). In English, the recipient can be realized either as an Noun Phrase (NP), John gave Mary the book, or as a Prepositional Phrase (PP), John gave the book to Mary. Bresnan et al. (2007) studied several factors that influence the realization of the dative. The complete dataset of Bresnan et al. (2007) is included in the data directory under data/dative.csv.

Use Pandas to read the dataset as a DataFrame:


In [ ]:
dative = # insert your code here

Inspect the first 10 lines of the dataframe:


In [ ]:
# insert your code here

The dataset provides for each entry the verb that was used. Create a Series object that counts how often each verb is used:


In [ ]:
# insert your code here

Plot the ten most frequent verbs as a pie chart:


In [ ]:
# insert your code here

The realization of the recipient (NP or PP) is stored in the column RealizationOfRecipient. How many NP realizations are there and how many PP realizations?


In [ ]:
# insert your code here

The recipients can be either animate or inanimate. How many PP realizations are inanimate? Tip: use pd.crosstab.


In [ ]:
# insert your code here

Make a single bar plot that visualizes for both PP and NP realizations of the recipient how often they are animate or inanimate. Tip: use the crosstab method of the previous exercise.


In [ ]:
# insert your code here

Sometimes it is convenient to sort a dataframe on a particular column. Write some code that sorts the dative dataframe on the realization of the recipient.


In [ ]:
# insert your code here

Now sort the data on the length of the recipient in descending order:


In [ ]:
# insert your code here

Write some code that prints the mean length of the recipient for both animate and inanimate recipients.


In [ ]:
# insert your code here

Make a boxplot that visualizes for each realization of the recipient (i.e. NP or PP) the length of the theme. Tip: make use of the method dative.boxplot.


In [ ]:
# insert your code here